For crowded scenes, the accuracy of object-based computer vision methodsdeclines when the images are low-resolution and objects have severe occlusions.Taking counting methods for example, almost all the recent state-of-the-artcounting methods bypass explicit detection and adopt regression-based methodsto directly count the objects of interest. Among regression-based methods,density map estimation, where the number of objects inside a subregion is theintegral of the density map over that subregion, is especially promisingbecause it preserves spatial information, which makes it useful for bothcounting and localization (detection and tracking). With the power of deepconvolutional neural networks (CNNs) the counting performance has improvedsteadily. The goal of this paper is to evaluate density maps generated bydensity estimation methods on a variety of crowd analysis tasks, includingcounting, detection, and tracking. Most existing CNN methods produce densitymaps with resolution that is smaller than the original images, due to thedownsample strides in the convolution/pooling operations. To produce anoriginal-resolution density map, we also evaluate a classical CNN that uses asliding window regressor to predict the density for every pixel in the image.We also consider a fully convolutional (FCNN) adaptation, with skip connectionsfrom lower convolutional layers to compensate for loss in spatial informationduring upsampling. In our experiments, we found that the lower-resolutiondensity maps sometimes have better counting performance. In contrast, theoriginal-resolution density maps improved localization tasks, such as detectionand tracking, compared to bilinear upsampling the lower-resolution densitymaps. Finally, we also propose several metrics for measuring the quality of adensity map, and relate them to experiment results on counting andlocalization.
展开▼